-
Notifications
You must be signed in to change notification settings - Fork 221
[LibOS] execve: eliminate race on clear_child_tid before VMAs deallocation
#2149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
… threads’ clear_child_tid before unmapping VMAs Signed-off-by: Wenhan Ji <[email protected]>
|
Jenkins, test this please |
|
Jenkins, retest this please (All failures seem to be connectivity issues.) |
Could someone kindly help me trigger a Jenkins retest? My retest command doesn’t seem to work—possibly due to a permission issue. I also don’t have access to the build logs. |
|
Jenkins, retest this please |
|
Add to whitelist |
donporter
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 4 files at r1, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: )
libos/src/bookkeep/libos_thread.c line 567 at r1 (raw file):
free_threads_array(threads, count); count = needed_count;
Minor: assert needed_count > count?
This is a clever way to do things, but a comment or two would help the reader understand why this loop terminates (i.e., that count monotonically increases to match the actual thread count, and new threads should not be created at this point).
Or, consider restructuring as a do loop, with needed_count <= count being the loop condition.
|
The deb job failed with: The repository 'http://deb.debian.org/debian bullseye-backports Release' no longer has a Release file. This looks legit, and unrelated to this PR. |
…r other threads’ clear_child_tid before unmapping VMAs Signed-off-by: Wenhan Ji <[email protected]>
forkthus
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 3 of 4 files reviewed, all discussions resolved, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @donporter)
libos/src/bookkeep/libos_thread.c line 567 at r1 (raw file):
Previously, donporter (Don Porter) wrote…
Minor: assert needed_count > count?
This is a clever way to do things, but a comment or two would help the reader understand why this loop terminates (i.e., that count monotonically increases to match the actual thread count, and new threads should not be created at this point).
Or, consider restructuring as a do loop, with needed_count <= count being the loop condition.
Done.
I added comments explaining why the loop terminates (count increases monotonically to the needed size) and an assert on the retry path.
I kept the structure to match the existing style—this pattern mirrors dump_vmas() in libos_vma.c.
|
Jenkins, retest this please. Just seeing if the webhook gets reactivated. |
Description of the changes
Fixes #2148
This PR makes
execve()wait until every sibling thread’s*clear_child_tidis zeroed before deallocating their VMAs.Implementation details:
g_thread_listas soon as the calling thread acquiresfirst.If
*clear_child_tid != 0, invokefutex_wait(); it will be awakened byrelease_clear_child_tid()viafutex_wake().*clear_child_tidare cleared, the calling thread then starts to deallocate VMAs.How to test this PR?
Repeating
gramine-sgx exec_same [args_#1...args_#49]Without this PR – the test usually fails within a few minutes, especially on the branch of PR #1795 because of the issue mentioned above. The main branch takes longer to fail.
With this PR applied – the same loop runs for hours without any failures on the branch of PR #1795 .
This change is